This is a study case used as a capstone for the Google Data Analytics certificate Problem: My favorite activity in the world is sleeping. Could I optimize it by controlling the different variables that might affect it?
Solution: Let’s dive into a sleeping study to discover which variables will improve or ruin a good night sleep!
library(tidyverse)
library(ggplot2)
library(patchwork)
library(ggpubr)
data <- read.csv("sleepdata.csv")
head(data)
## Person.ID Gender Age Occupation Sleep.Duration Quality.of.Sleep
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representative 5.9 4
## 5 5 Male 28 Sales Representative 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1 42 6 Overweight 126/83 77
## 2 60 8 Normal 125/80 75
## 3 60 8 Normal 125/80 75
## 4 30 8 Obese 140/90 85
## 5 30 8 Obese 140/90 85
## 6 30 8 Obese 140/90 85
## Daily.Steps Sleep.Disorder
## 1 4200 None
## 2 10000 None
## 3 10000 None
## 4 3000 Sleep Apnea
## 5 3000 Sleep Apnea
## 6 3000 Insomnia
summary(data)
## Person.ID Gender Age Occupation
## Min. : 1.00 Length:374 Min. :27.00 Length:374
## 1st Qu.: 94.25 Class :character 1st Qu.:35.25 Class :character
## Median :187.50 Mode :character Median :43.00 Mode :character
## Mean :187.50 Mean :42.18
## 3rd Qu.:280.75 3rd Qu.:50.00
## Max. :374.00 Max. :59.00
## Sleep.Duration Quality.of.Sleep Physical.Activity.Level Stress.Level
## Min. :5.800 Min. :4.000 Min. :30.00 Min. :3.000
## 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00 1st Qu.:4.000
## Median :7.200 Median :7.000 Median :60.00 Median :5.000
## Mean :7.132 Mean :7.313 Mean :59.17 Mean :5.385
## 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :8.500 Max. :9.000 Max. :90.00 Max. :8.000
## BMI.Category Blood.Pressure Heart.Rate Daily.Steps
## Length:374 Length:374 Min. :65.00 Min. : 3000
## Class :character Class :character 1st Qu.:68.00 1st Qu.: 5600
## Mode :character Mode :character Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
## Sleep.Disorder
## Length:374
## Class :character
## Mode :character
##
##
##
#Number of null values per column
colSums(is.na(data))
## Person.ID Gender Age
## 0 0 0
## Occupation Sleep.Duration Quality.of.Sleep
## 0 0 0
## Physical.Activity.Level Stress.Level BMI.Category
## 0 0 0
## Blood.Pressure Heart.Rate Daily.Steps
## 0 0 0
## Sleep.Disorder
## 0
gen_avg <- data %>%
group_by(Gender) %>%
summarise(mean_age = mean(Age), n = n())
gen_avg
## # A tibble: 2 × 3
## Gender mean_age n
## <chr> <dbl> <int>
## 1 Female 47.4 185
## 2 Male 37.1 189
The average age of Women is higher. This could skew the data as age could be an important factor in sleep quality. But we do have a 50/50 gender representation.
data$Sleep.Disorder = factor(data$Sleep.Disorder, levels = c('None','Insomnia','Sleep Apnea'))
df <- data %>%
group_by(Sleep.Disorder) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(df, aes(x = "", y = perc, fill = Sleep.Disorder)) +
geom_col() +
geom_label(aes(label = labels), color = "black",
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_fill_brewer(palette = "Reds") +
coord_polar("y", start = 0) +
theme_void() +
ggtitle("Percentage of Sleep Disorders") +
theme(plot.title = element_text(hjust=0.5))
We do not have an equal representation of every sleep class, but we almost have a 60/40 representation of healthy and disease respectively.
my_comparisons <- list( c("None", "Insomnia"),
c("Insomnia", "Sleep Apnea"),
c("None", "Sleep Apnea") )
ggplot(data, aes(x=Sleep.Disorder, y=Quality.of.Sleep, fill=Sleep.Disorder)) +
geom_boxplot()+
geom_jitter(shape=16, position=position_jitter(0.2)) +
scale_fill_brewer(palette="Reds") +
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 2, color = "white")+
stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
stat_compare_means(label.y = 12) + # Add global p-value
ggtitle("Quality fo Sleep score distribution \n per Sleep Disorder") +
theme(plot.title = element_text(hjust=0.5))
The average quality of aleep score for a person without a disorder is ~7.5. Sleep apnea does not seem to be correlated with a lower quality of sleep, we can see no significant difference between the Sleep Apnea and No Disorder classes. Insomnia is the only class where we see a significant loss in quality of sleep. The quality of sleep metric seems to be very subjective as it does not definitely indicate a sleep disorder or not.
gend_group <- data %>%
group_by(Gender,Sleep.Disorder,Age)
gender_bar <- ggplot(data, aes(x=Gender, y=Age, fill=Sleep.Disorder)) +
scale_fill_brewer(palette="Reds") +
geom_bar(stat="summary", fun.y = "mean", position="dodge") +
ggtitle("Sleep Disorders per Gender")+
theme(plot.title = element_text(hjust=0.5))
gender_bar
As stated before, the female group has a higher average age and seems more affected by sleep apnea that the male group, while the male group seems the most affected by insomnia.
data$BMI.Category <- gsub("Normal Weight", "Normal", data$BMI.Category)
data$BMI.Category <- factor(data$BMI.Category, labels=c('Normal','Overweight','Obese'))
bmi_bar <- ggplot(data, aes(x=BMI.Category, fill=Sleep.Disorder)) +
scale_fill_brewer(palette="Reds") +
geom_bar() +
ggtitle("BMI effect on Sleep") +
theme(plot.title = element_text(hjust=0.5))
bmi_bar
We cleaned up this column by merging “Normal” and “Normal Weight” samples. The overweight category seems severely under represented.
A conclusion we can pull from this figure is that BMI is correlated with sleep quality. A higher BMI tends to indicate insomnia and/or sleep apnea, while in the lowest BMI category has a majority of normal sleep. Obese class has almost 50/50 split on both sleep disorders. We cannot make an conclusions on the Overweight category due to the small sample size but it does seem to indicate a correlation with sleep disorders.
data_stress <- data %>%
group_by(Occupation) %>%
summarize(avg_stress = mean(Stress.Level),
avg_age = mean(Age),
count = n())
occup_point <- ggplot(data, aes(x=Occupation, y=Stress.Level)) +
geom_point(aes(fill=Sleep.Disorder,size=Age), color='black', shape=21, stroke=0.4) +
scale_fill_brewer(palette="Reds") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5),
plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
axis.title.x = element_text(margin = margin(t = 20)),
plot.title = element_text(hjust=0.5)) +
ggtitle("Occupation, Age and Stress \n effect on Sleep")
occup_bar <- ggplot(data, aes(x=Occupation, fill=Sleep.Disorder)) +
geom_bar() +
scale_fill_brewer(palette="Reds")+
ggtitle("Occupation \n effect on Sleep") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.7),
plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
axis.title.x = element_text(margin = margin(t = 20)),
plot.title = element_text(hjust=0.5))
print(data_stress)
## # A tibble: 11 × 4
## Occupation avg_stress avg_age count
## <chr> <dbl> <dbl> <int>
## 1 Accountant 4.59 39.6 37
## 2 Doctor 6.73 32.7 71
## 3 Engineer 3.89 46.6 63
## 4 Lawyer 5.06 39.4 47
## 5 Manager 5 45 1
## 6 Nurse 5.55 51.8 73
## 7 Sales Representative 8 28 2
## 8 Salesperson 7 43.5 32
## 9 Scientist 7 33.5 4
## 10 Software Engineer 6 31.2 4
## 11 Teacher 4.53 41.7 40
occup_bar + occup_point + plot_layout(widths = c(1, 1))
The occupations with the highest stress scores seem to be the most
correlated with sleep disorders, as illustrated by Sales representative,
Salespersons and Scientists, especially causing sleep apnea. The
occupations with a lower average stress score but a higher age group
average, like nurses and teachers, also show an increase in the number
of sleep disorders, even when the stress score is on the low end (see
Nurses with score of 3).
BP_ranges = c('Normal', 'Elevated', 'Hypertension_1', 'Hypertension_2', 'Hypertensive_Crisis')
BP_systolic_limits = list(c(0,120),c(120,130),c(130,140),c(140,180),c(180,200))
BP_diastolic_limits = list(c(0,80),c(0,80),c(80,90),c(90,120),c(120,140))
# Categorize into BP ranges depending on the systolic and diastolic values
data <- data %>%
separate(Blood.Pressure, into=c('Systolic','Diastolic'), sep="/") %>%
mutate(Systolic = as.numeric(Systolic),
Diastolic = as.numeric(Diastolic)) %>%
mutate(BP_range = case_when(
between(Systolic, BP_systolic_limits[[1]][1], BP_systolic_limits[[1]][2]) &
between(Diastolic, BP_diastolic_limits[[1]][1], BP_diastolic_limits[[1]][2]) ~ BP_ranges[1],
between(Systolic, BP_systolic_limits[[2]][1], BP_systolic_limits[[2]][2]) &
between(Diastolic, BP_diastolic_limits[[2]][1], BP_diastolic_limits[[2]][2]) ~ BP_ranges[2],
between(Systolic, BP_systolic_limits[[3]][1], BP_systolic_limits[[3]][2]) &
between(Diastolic, BP_diastolic_limits[[3]][1], BP_diastolic_limits[[3]][2]) ~ BP_ranges[3],
between(Systolic, BP_systolic_limits[[4]][1], BP_systolic_limits[[4]][2]) &
between(Diastolic, BP_diastolic_limits[[4]][1], BP_diastolic_limits[[4]][2]) ~ BP_ranges[4],
between(Systolic, BP_systolic_limits[[5]][1], BP_systolic_limits[[5]][2]) &
between(Diastolic, BP_diastolic_limits[[5]][1], BP_diastolic_limits[[5]][2]) ~ BP_ranges[5],
TRUE ~ "Unknown" # Default case
))
# New values to replace "Unknown"
replacement_values <- c(rep("Elevated", 12), rep("Hypertension_1", 2), "Elevated")
data <- data %>%
mutate(BP_range = replace(BP_range, BP_range == "Unknown", replacement_values)) %>%
mutate(Heart.Rate = as.factor(Heart.Rate))
data$Heart.Rate <- as.numeric(as.character(data$Heart.Rate))
# Group by BP_range and Heart.Rate, then count occurrences
data_grouped <- data %>%
group_by(BP_range, Heart.Rate, Sleep.Disorder, Age) %>%
summarise(Count = n(), .groups = "drop") # Count occurrences of each Heart Rate
# Convert BP_range into a factor with the specified order
data_grouped$BP_range <- factor(data_grouped$BP_range, levels = BP_ranges)
blood_bar <- ggplot(data_grouped, aes(x = BP_range, y = Count, fill = Heart.Rate)) +
geom_col() +
scale_fill_viridis_c(option = "magma", direction = -1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
ggtitle("Heart rate of \n of blood pressure ranges") +
theme(plot.title = element_text(hjust=0.5))
blood_bar
We separated the “Blood.Pressure” column into “Systolic” (#/) and “Diastolic” (/#) categories and then depending on their combination create a categorical blood pressure range column: “BP_range” from the American Heart Association . We manually clean up the “Unknown” ranges by making assumptions (putting more importance on the systolic number) and assigning them a category.
As expected, we can see an increase in heart rate with each blood pressure range.
data_grouped$Sleep.Disorder = factor(data_grouped$Sleep.Disorder, levels =
c('None','Insomnia','Sleep Apnea'))
blood_sleep <- ggplot(data_grouped, aes(x=BP_range, y=Heart.Rate)) +
geom_jitter(aes(fill=Sleep.Disorder,size=Age), color='black',
shape=21, stroke=0.4, width=0.3) +
scale_fill_brewer(palette="Reds") +
ggtitle("Effect of BP range, Heart Rate \n and Age on Sleep") +
theme(plot.title = element_text(hjust=0.5))
blood_sleep
We see an evident correlation between hypertension and sleep disorders.
Regardless of age, a lower heart rate and a normal BP range indicates a
lower propensity to sleep disorders, while the hypertension I and II
have a higher proportion of insomnia and sleep apnea, with hypertension
II being almost exclusively being sleep apnea. An elevated BP does not
seem correlated with sleep disorders.
data <- data %>%
mutate(Stress.Level = as.factor(Stress.Level))
act_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
geom_jitter(aes(size=Age, fill=Stress.Level), color='black', shape=21, stroke=0.4) +
scale_fill_brewer(palette="Blues") +
geom_smooth(method=lm, linetype="dashed",
color="darkblue")
act_stress_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
geom_jitter(aes(fill=Sleep.Disorder, size=Age), color='black', shape=21, stroke=0.4) +
scale_fill_brewer(palette="Reds") +
geom_smooth(method=lm, linetype="dashed",
color="darkred")
act_bar + act_stress_bar + plot_layout(widths = c(1, 1))
We see a really strong correlation between physical activity and daily steps with the stress level, the lower the physical activity and daily steps, the higher the stress level the person feels regardless of their age. We see a couple of outliers where the stress levels are high at the top right (highest number of steps with highest activity level). As seen by the following table:
data %>%
group_by(Occupation, Stress.Level) %>%
summarise(avg_steps = mean(Daily.Steps),
avg_activity = mean(Physical.Activity.Level))
## # A tibble: 32 × 4
## # Groups: Occupation [11]
## Occupation Stress.Level avg_steps avg_activity
## <chr> <fct> <dbl> <dbl>
## 1 Accountant 3 7500 80
## 2 Accountant 4 7000 60
## 3 Accountant 6 7200 53.3
## 4 Accountant 7 6000 45
## 5 Doctor 3 6850 87.5
## 6 Doctor 5 3500 65
## 7 Doctor 6 8000 75
## 8 Doctor 8 5848. 31.8
## 9 Engineer 3 5176. 30.9
## 10 Engineer 4 5644. 63.9
## # ℹ 22 more rows
The higher number of steps could be a result of the job and not a result of leisure sport, especially with nurses who have the highest number of steps over all the other occupations (8058 steps).
Insomnia seems to be correlated with a lower level of physical, while sleep apnea does not follow this pattern. The latter one seems to mainly affect older people regardless of the physical activity.
I
sleep_q <- ggplot(data, aes(x=Sleep.Duration,y=Quality.of.Sleep)) +
geom_jitter(aes(fill=Sleep.Disorder), color='black',
shape=21, stroke=0.4, size=3, width=0.5) +
scale_fill_brewer(palette="Reds")
sleep_q
The goal of this study case was to use data analytics to derive insights, however I want to also apply some machine learning to obtain an order of feature importance and to see if this data is enough to predict a sleep disorder